Skip to content

[DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) #1049

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 14 commits into
base: 16/edge
Choose a base branch
from

Conversation

taurus-forever
Copy link
Contributor

Issue

The previous is_restart_pending() waited for 15 seconds due to the
Patroni's loop_wait default value (10 seconds), which tells how much time
Patroni will wait before checking the configuration file again to reload it.

Solution

Instead of checking PostgreSQL pending_restart from pg_settings,
check Patroni API pending_restart=True/undefined.

Checklist

  • I have added or updated any relevant documentation.
  • I have cleaned any remaining cloud resources from my accounts.

Copy link

codecov bot commented Jul 15, 2025

Codecov Report

❌ Patch coverage is 29.16667% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.40%. Comparing base (ee02d5a) to head (ee8e44b).

Files with missing lines Patch % Lines
src/charm.py 32.00% 15 Missing and 2 partials ⚠️
src/cluster.py 28.57% 15 Missing ⚠️
src/relations/async_replication.py 0.00% 2 Missing ⚠️

❌ Your patch check has failed because the patch coverage (29.16%) is below the target coverage (33.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (62.40%) is below the target coverage (70.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           16/edge    #1049      +/-   ##
===========================================
- Coverage    64.87%   62.40%   -2.47%     
===========================================
  Files           17       17              
  Lines         4270     4272       +2     
  Branches       656      655       -1     
===========================================
- Hits          2770     2666     -104     
- Misses        1333     1440     +107     
+ Partials       167      166       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@taurus-forever taurus-forever changed the title Use Patroni API for is_restart_pending() [DPE-7726] Use Patroni API for is_restart_pending() (instead of SQL select from pg_settings) Jul 15, 2025
@taurus-forever taurus-forever force-pushed the alutay/is_restart_pending branch 5 times, most recently from 9c17d76 to e31de74 Compare August 2, 2025 00:40
The previous is_restart_pending() waited for long due to the Patroni's
loop_wait default value (10 seconds), which tells how much time
Patroni will wait before checking the configuration file again to reload it.

Instead of checking PostgreSQL pending_restart from pg_settings,
let's check Patroni API pending_restart=True flag.
The current Patroni 3.2.2 has wired/flickering  behaviour:
it temporary flag pending_restart=True on many changes to REST API,
which is gone within a second but long enough to be cougth by charm.
Sleepping a bit is a necessary evil, until Patroni 3.3.0 upgrade.

The previous code sleept for 15 seconds waiting for pg_settings update.

Also, the unnecessary restarts could be triggered by missmatch of
Patroni config file and in-memory changes coming from REST API,
e.g. the slots were undefined in yaml file but set as an empty JSON {} => None.
Updating the default template to match the default API PATCHes and avoid restarts.
On topology observer event, the primary unit used to loose Primarly label.
@taurus-forever taurus-forever force-pushed the alutay/is_restart_pending branch from e31de74 to 1703639 Compare August 13, 2025 23:16
Also:
* use commong logger everywhere
* and add several useful log messaged (e.g. DB connection)
* remove no longer necessary debug 'Init class PostgreSQL'
* align Patroni API requests style everhywhere
* add Patroni API duration to debug logs
The list of IPs were randomly sorted causing unnecessary Partroni
configuration re-generation with following Patroni restart/reload.
…hanged

Those defers are necessary to support scale-up/scale-down during the refresh,
while they have significalty slowdown PostgreSQL 16 bootstrap (and other
daily related mainteinance tasks, like re-scaling, full node reboot/recovery, etc).

Muting them for now with the proper documentation record to
forbid rescaling during the refresh, untli we minimise amount of defers in PG16.
Throw and warning for us to recall this promiss.
The current PG16 logic relies on Juju update-status or on_topology_change
observer events, while in some cases we start Patroni without the Observer,
causing a long waiting story till the next update-status arrives.
It is hard (impossible?) to catch the Juju Primary label
manipulations from Juju debug-log. Logging it simplifyies troubleshooting.
We had to wait 30 seconds in case of lack of connection which is unnecessary long.

Also, add details for the reason of failed connection Retry/CannotConnect.
It speedups the sinble unit app deployments.
@taurus-forever taurus-forever force-pushed the alutay/is_restart_pending branch from 1703639 to ee8e44b Compare August 13, 2025 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant